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Abstract 

We study the problem of prediction for 
evolving graph data. We formulate the 
problem as the minimization of a con- 
vex objective encouraging sparsity and low- 
rank of the solution, that reflect natu- 
ral graph properties. The convex formu- 
lation allows to obtain oracle inequalities 
and efficient solvers. We provide empiri- 
cal results for our algorithm and compari- 
son with competing methods, and point out 
two open questions related to compressed 
sensing and algebra of low-rank and sparse 
matrices. 



1. Introduction 

We study the prediction problem where the obser- 
vation is a sequence of graphs adjacency matrices 
(A t )o<t<T and the goal is to predict At+i- This 
type of problem arises in applications such as rec- 
ommender systems where, given information on pur- 
chases made by some users, one would like to predict 
future purchases. In this context, users and products 
can be modeled as nodes of a bipartite graph, while 
purchases or clicks are modeled as edges. 

In functional genomics and systems biology, estimat- 
ing regulatory networks in gene expression can be 
performed by modeling the data as graphs and fit- 
ting predictive models is a natural way for estimat- 
ing evolving networks in these contexts. 

A large variety of methods for link prediction 
only consider predicting from a single static 



snapshot of the graph - this includes heuris- 
tics (Liben-Nowell & Kleinberg, 2007; Sarkar et al., 
2010), matrix factorization (Koren, 2008) or prob- 
abilistic methods (Taskar et al., 2003). More re- 
cently, some works have investigated using sequences 
of observations of the graph to improve the predic- 
tion, such as using regression on features extracted 
from the graphs (Richard et al., 2010), using ma- 
trix factorization (Koren, 2010), or in some spe- 
cial cases probabilistic techniques. Most techniques, 
however, do not explicitly take into account the in- 
herently sparse nature of usual sequences of adja- 
cency matrices. In this work, we extend the work 
of (Richard et al., 2010) to address this and propose 
in addition a more principled way of predicting us- 
ing features extracted from the sequence of graph 
snapshots. 

We make the following assumptions about the graph 
sequence (represented by adjacency matrices A t ): 

1. Low- rank. A t has low rank. This reflects the 
presence of highly connected groups of nodes 
such as communities in social networks. 

2. Autoregressive linear features. We assume 
given a linear map lo : W ixn M. d defined by a 
set of n x n matrices (fli)i<i<d 

uj(S)= ($li,S),— ,{Sl d ,S)\ (1) 

such that the vector time series u>(A t ) has an 
autoregressive evolution: 

u(A t+1 ) = W T u(At)+N t 

where Wo £ M. dxd is a sparse matrix such that 
(oj(At))t>o is stationary. An example of linear 
features is degrees that is a popularity measure 
in social and commerce networks. 



2. Formulation of an optimization 
problem 

In order to reflect the stationarity assumption on 
u(At) we use a convex loss function 

£ : R d x R d ^ R + 

to penalize the dissimilarity between two feature vec- 
tors at successive time steps. Let us introduce 



3. Oracle inequality 

Define (<5,e T ) = [ X T ,0) - <$>(A T +i,W ), i.e. 



and 



5 = — Xj_iWo 



e = WJlj(A t ) - lj(A t +i) ■ 



and 



Xt-i 



coiA.y 
\u(a t -i) t J 



u{A 2 y 



pTxci 



„Txd 



\uj(A T yJ 

We also use £ to design the elementwise extension of 
t to Xs. In case of quadratic loss, we consider the 
following £i penalized regression objective : 



We define M = J2i=i where fli are defined in 
(1) and let 

H = Xf_ x 5- uj(A T )e T . 

We defined M and S such that they verify 

((8,e),$(S,W)) = ((M,E),(S,W)) 

The following result can be proved using the tools 
introduced in (Koltchinskii et al., 2011). 

Proposition 1. Let (S, W) be the minimizers of 
C(S, W) over a convex cone S x W. Suppose that 



Ji(w) = -||XT-iW-XT||i + /s||w||i . 

a 

To predict At+i, we propose a regression term pe- 
nalized by the sum of £\ and trace norm in the same 
fashion as in (Richard et al., 2012) in order to pre- 
dict the future graph At+i given the prediction of its 
features uj(At) t W should approximate uj(S) well: 

MS,W) = i||w(A T )W-w(5)|||.+r||S||.+7||5|| 1 

The overall objective function we consider here is the 
sum of the two partial objectives J\ and J 2l which 
is convex as J\ and J 2 are both convex. 



1. for some fi > 0, and for any Si,S 2 € S and 
Wi,W 3 € W, 



-\\$(Si -S 2 ,W X -W 2 ) 



>fi~'[ \\S 1 -S 2 \\ z F + T\\W 1 -W 2 r F 



2. r > ^\\M\\ op , 7 > 2 -^\\M\U k > 2||S|| 
for any real number a £ (0; 1); 

then 



£{S, W) = -||X T -iW - Xrlla + « ||W||i 

+ ±\HA T yw~u(sy\\l + T \\s\\, + j\\s\\ 



Let us introduce the linear map <E> defined by 
$(5, W) = { X T _iW, uj{Sy - uj{A T yW 



The objective can be written as a penalized least 
squared regression on the joint variable (5*, W) : 

c(s,w) = h\$(s,w)- (x T ,o 



\S-At+i\\ 2 f + T\\W-W \\ 2 f < 

V2 + 1 



+1 \\s\\ 1 + T\\s\u + \\wh 



H 2 min I ^- ^Vrank(A T +i) ^ + 1 +j^/\\A T+ i\\ 

2 2 

+^-||Wo|| 0j 2r||Ar + i||*+27||A T +i||i+2«||Wo||i 



The latter inequality shows how the quality of the 
solution is bounded by the rank and sparsity of the 
future graph At+i, and the interplay between these 
two prior through the parameter a. The dependence 
in T quantifies the improvement of the estimation in 
terms of the number of observations. 



4. Algorithms 



4.1. Generalized forward- backward 
algorithm for minimizing C 

We use the algorithm designed in (Raguet et al., 
2011) for minimizing our objective function. Note 
that this algorithm outperforms the method intro- 
duced in (Richard et al., 2010) as it directly min- 
imizes C jointly in (S, W) whereas the previous 
method first estimates W by minimizing a functional 
similar to J\ and then minimizes C(S,W). 

In addition to this we use the novel joint penalty 
from (Richard et al., 2012) that is more suited for 
estimating graphs. The proximal operator for the 
trace norm is given by the shrinkage operation, if 
Z = U diag(<7i, ■ • • , a n )V T is the singular value de- 
composition of Z 1 

prox T|MU (Z) = C/diag((a l - r) + \V T . 

Similarly, the proximal operator for the ^i-norm is 
the soft thresholding operator defined by using the 
entry- wise product of matrices denoted by o: 



prox 



sgn(Z) o {\Z\ - i). 



The algorithm converges under very mild conditions 
when the step size 9 is smaller than where L is 
the operator norm of $. 



Algorithm 

Minimize C 



1 Generalized Forward-Backward to 



Initialize S, Z 1: Z 2 , W, q = 2 
repeat 

Compute (Gs,G w ) = Vs,w$(S, W). 
Compute Z\ = prox 9 g T | | t (2S' — Z\ — 
Compute Z 2 = prox g6 , 7 || || 1 (2S - Z 2 - 

Set W = proxg KMl (W - 9G W ) 
until convergence 
return (S, W) minimizing C 



0G S ) 
0G S ) 



J(U,V,W) = -WXt^W -X T \\% 



-\\uj(A t 
a 



+ l{\\U\\i + 



V\ 



which is a non-convex function of the joint variable 
(f, V, W), making the theoretical analysis more dif- 
ficult. Given that the objective is convex in a neigh- 
borhood of the solution, by initializing the variables 
adequately, we can write an algorithm inspired by 
proximal gradient descent for minimizing it. 

Algorithm 2 Minimize J 



Initialize U, V, W 
repeat 

Compute (Gjj, Gv, Gw) 



u.v,w 



<f>(UV T ,W). 



Set C7 = prox 07 || ^{U-9G V ) 
Set F = prox 07 || lh (V-9G v ) 
Set W = W ox 9kMi (W - 6G W ) 

until convergence 

return (U, V, W) minimizing J 



5. Numerical Experiments 

5.1. A generative model for graphs having 
linearly autoregressive features 

Let Vq € W nxr be a sparse matrix, Vq its pseudo- 



v o v o 



inverse such, that Vq Vq 
two sparse matrices Wq g 
. Now define the sequence of matrices (A t ) t >Q for 
t = l,2,.-- by 



I r . Fix 

rxr and U G R nxr 



U t = U t -iWo + N t 



and 



A t = U t V Q T + M t 

for i.i.d sparse noise matrices N t and M t , which 
means that for any pair of indices with high 

probability (N t )i,j = and {M t \j = 0. 



If we define the linear feature map to (A) 
note that 



AV r 



Tt 



4.2. Non-convex Factorization Method 

An alternative method to the estimation of low-rank 
and sparse matrices by penalizing a mixed penalty of 
the form t\\S\\* +7||5||i as in (Richard et al., 2012) 
is to factorize S = UV T where U,V G M' iXr are 
sparse matrices, and penalize 7(||t/||i + HV^Ii). The 
objective function to be minimized is 



1. The sequence u>(At) 



= U, 



M t F Tt 



follows the linear autoregressive relation 
uj(A t ) T - uiAt^yWo + N t + M t V Tl . 

2. For any time index t, the matrix A t is close to 
UtVo that has rank at most r 

3. At is sparse, and furthermore Ut is sparse 



5.2. Results 

We tested the presented methods on synthetic 
data generated as in section (5.1). In our ex- 
periments the noise matrices Mt and Nt where 
built by soft-thresholding iid noise J\f(0,a 2 ), 
n = 50, T = 10, r = 5, d = 10, a = .5. After choosing 
the parameters t, 7, r by 10-fold cross-validation, we 
compare our methods to standard baselines in link 
prediction (Liben-Nowell & Kleinberg, 2007). We 
use the area under the ROC curve as the measure 
of performance and report empirical results aver- 
aged over 10 runs. Nearest Neighbors (NN) relies 
on the number of common friends between each pair 
of nodes, which is given by A 2 when A is the cumu- 
lative graph adjacency matrix At = St=o ^* an< ^ 
we denote by Shrink the low-rank approximation of 
At- Since Vq is unknown we consider the feature 
map u)(S) = SV where A T = UY,V T is the SVD of 
At- 



Method 


AUC 


NN 
Shrink 
rain C 
m\nj 


0.8691 ± 0.0168 
0.8739 ± 0.0169 
0.9094 ± 0.0176 
0.9454 ± 0.0087 



Table 1. Performance of algorithms in terms of Area Un- 
der the ROC Curve. 



6. Discussion 

The experiments suggest the empirical superiority of 
the proposed approaches to the standard baselines. 
It is very intriguing that the non-convex matrix fac- 
torization outperforms the convex rival. A possible 
explanation is that minimizing the nuclear norm by 
using the shrinkage operator results in factorizations 
of the solution by two orthogonal matrices, which 
conflicts with the sparsity of the solution. The other 
benefit of the non-convex formulation is its scalabil- 
ity, as the proximal method proposed for the convex 
formulation scales in 0(n 2 ) in storage and 0(n 3 ) in 
time. Several questions open perspectives for further 
investigations. 

1. Choice of the feature map uj. In the current 
work we used the projection onto the vector 
space of the top-r singular vectors of the cu- 
mulative adjacency matrix as the linear map 
omega, and this choice has shown empirical 
superiority to other choices. The question of 
choosing the best measurement to summarize 
graph information as in compress sensing seems 



to have both theoretical and application poten- 
tial. 

2. Characterization of sparse and low-rank matri- 
ces. Can all the sparse and low-rank matrices 
S be written as S = UV T where U, V € R" xr 
are both sparse? Or in other terms, what is the 
relation between the solution of problems pe- 
nalized by || f ||i + ||V||i -such as J- and those, 
e.g. C, penalized by \\S\\i + /8||5||* ? 

[Appendix : Proof of proposition (1)] 
A. Preliminary Tools 

Definition 1 (Orthogonal projections associated 
with S). Let S £ R" x ™ be a rank r matrix. We can 
write the SVD of S in two ways: S = (TjUjVj 
or S = V"SV T , where U, V € ]R nxr are orthogonal 
and £ = diag(cri, ■ ■ • ,oy). Let U and V matri- 
ces of size nx (n — r) ortho-normally completing the 
bases of U and V, and define the projections onto 
the vector spaces spanned by vectors Ui and Vi for 
i = 1, • • • r: 

Pu = UU T , Pu± = [/- L C/ ±T 

P v = VV T , P v ± = V^V^ 
and define the orthogonal projection 

Vs{B) = B - Pu±BP v ± 

We highlight the fact that Vs(B) can also be written 
as 

T S (B) = PuBP v + PuBP v ± + P V ^BP V 

or 

Ps(B) = P u B + P u ±BP v 

We know that rank(7' s (S)) < rank(B) and 
i&nk{PijBPv) < rank(.B). The two following in- 
equalities will also be useful: 

Lemma 1 (Rank inequalities). For any matrix B, 

1. rank(7? s (B)) < 2 rank(5) 

2. rank( P V BP V ) < rank(S) 

Lemma 2 (Orthogonality of the decomposition). 
For any matrix B, with the same notations, we have 

B = P ;J ±BP V ± + P V BP V + P V BP V ± + Pu±BP v 

and the 4 terms are pairwise orthogonal. It follows 
that 



1. We have the identity 

\\B\\% = \\P U ±BP V ±\\ 2 F + \\P U BP V \\ 2 F 

+ \\P U BPy,\\l + \\P U ,BP V \\ 2 F 

2. It follows that \\PuBP v \\f <\\V s {B)\\ f 
B. Proof 

We have for any (5*, W) € S x W, by optimality of 
(S,W): 



\\§(S-A T+1 ,W-W )\\ 2 F 



\mS-A T+1 ,W-W )\\ F 



= ~(\ms,w)\\ F -\ms,w)\\ F 

2(<5>(S -S,W- W), <f>{A T+ i, W )} 



< ~($(S-S,W-W), (X T ,0j -$(A T+1 ,W )) 

+r(||S||,-||S||.)+7(||5||i-||5||i)+/c(||W||i-||W||i) 

= ~{@-S,W-W),(M,8)) 
+r(||S||.-||5||,)+7(||S|| 1 -||5|| 1 )+«(||W||i-||W||i) 



(3) 



Thanks to trace-duality and ^-duality we have 
(M,X) < HA/IWlXHi and (M,X) < ||M|| op ||X||* 
for any X, so for any a G [0; 1], 



For proving the other bound, we start by set- 
ting some notations. Let S G S, and let 
r = rank(S), k = ||S|| , q = \\W\\ . Let 
S = E/diag(<7i,--- ,a r )V T be the SVD of S 
and let S = 6 5 o W = Ow ° \W\ where 
9 S G {0,±l}' iXn , Q w G {0,±l} dxd are sign matri- 
ces of S, and W such that j|6s||o = k, \\&w\\o = q 
and o is the entry- wise product. Let 

Z = tZ* + 7Z1 

= t ( J2 UjV]+P U± G*P V A^ +7 (Qs+G^ 

denote an element of the subgradient of the convex 
function S H> t||5||* +7||5||i, so ||G*|| op < 1 and 
||Gi||oo < 1. There exist G\ and G* such that 



(Z, S-S)= t(J2 Ujv],S -S) + t\\P u± SP v± \U 

3 = 1 

+ ~f(e s ,s-s) + 7 ||e^o5||i 

We use the two standard inequalities of convex func- 
tion subdifferentials (d£(S, W), (S-S, W-W)) < 
and (S — S, Z — Z) > and a similar inequality on 
subdifferentials on W and W of W H> ||W||i, de- 
noted by Q and Q. We get 

(dC{S, W), (S — S,W — W)) 

-{Z-Z,S-S)-(Q-Q,W-W)<0 (5) 

Therefore we obtain 



-\mS-A T+1 ,W-W )\\% 
a 

< ~\\$(S-A t+1 ,W-W )\\ 2 f + t\\S\U 

-T\\S\U + 2a\\S-S\\4M\\ op 
+ 7||5 , ||i-7||5||i+2(l-a)||5-5'|| 1 ||M|| 00 
+ k\\W\\ 1 -k\\W\\ 1 + 2\\W-W\\ 1 \\Z\\ 00 (4) 

now using assumptions t > ^||M|| OJ ,, 

7 > 2(1 rf '" ) ||Af Hooa and k > §||S||oo and then 
triangular inequality 



1 



\\$(S-A T+1 ,W-W )\\ 2 F < 



|*(5-A r+ i,W-Wb)|||.+2r|| ) S||.+27|| ) S||i+2«||W r || ] 

□ 



(V( S ,w)\\*(S, W)- f X T , Oj HI, (S-S, W-W)) = 

2((S, e),$(S - S,W - W)) 
- 2{$(S - A T+1 ,W - W ),$(S-S,W- W)) 

The inequality (5) can be written as 

-M(S - A T+1 ,W - W ),$(S - S,W - W)) < 
a 

h(8,e),*(S-S,W-W)) 
a 

r 

- T (J2 Uj v],S-S)-T\\v£(S)\U 

3=1 

-k{G w ,W-W)-k\\G^oW\\i (6) 



Thanks to Cauchy-Schwarz 



C auchy- S chwar z 



| (J2 Ui vj , S - S) | < Vf 1 1 P v (S - S)P V ! | p 

3 = 1 



similarly 



(M, S - S) 

<a( V2rp/|| op ||7's(5-5)||F 



and 



|(e Sj 5-5>| < \^||e s o(s-5)||i. 



\(Q w ,w-w}\ < y/q\\e w °(w-w)\\ F 



+ \\M\\ op \\P u ±SP v ±\\* 

+ (l-o)(>/A||M|| 00 ||e s o(5-5)ll 

+ ||M|| 00 ||e£o5||i) (8) 



so we have 



-($(S- A T +i,W- W ),<f>(S - S,W -W)) < 



Now by using 



2($(5 - A T+U W - W ), 1>(S -S,W- W)) 



'-((5, e), $(S -S,W-W)) + tV¥\\Pu(S - S)P v \\f ||$(s - A T+1 ,W - W )f 2 + \\$(S - S,W - W)\\\ 



-T\\p u± sp v± \u+"/Vk\\e s °(s-s)\\ F -j\\ejos\\ 1 

+ K^q\\Q w o(W-W)\\ F -K\\e^oW\\ 1 (7) 

We need to bound ((5, e), $(S - 5, W - W)). For 
this, note that by definition, 

({6,e),®(S-S,W-W)} = ((M,E),(S-S,W-W) 
and decompose for any a G [0, 1] 

M = 



a\V s {M)+P u ±MP v ±]+(l-a) [e s °M+e^oM 

We get by applying triangle inequality, Cauchy- 
Schwarz, Holder inequality written for the trace- 
norm and £i-norm 



(M, S-S) 

<a[ \\Vs(M)\\p\\V s {S - S)\U 



\\$(S-A T+1 ,W-Wo)\\ 



we can rewrite the inequality (7) as follows: 



-h\$(S-A T +i,W-W )\\l 

+ \\$(s-s,w- w)\\l - 11^(5-^+0111 



2a 



< — (^V2 r\\M\\ op \\Vs(S-S)\\ F +\\ M \\op\\Pu^SPv- 
+ ^-j^(Vk\\M\\ oc \\S-S\\ F + \\M\\ 00 \\ejoS\\ 1 



~\jq\\z\\ oc \\w-w\\ F + usiiooiie^o^iii 

+ tV^\\Pu(S - S)P V \\ F - t\\Pu^SPv±\\* 

+ Wk\\e s o{s-S)\\F-'r\\&8 o S\\i 
+ K^q\\e w o (w - w)\\ F - K \\eh o w\\i 

< V^(^a\\M\\ op + r\\S-S\\ F 



+ \\P u ±MPv±\\ op \\Pb±SPv±\\ 
+ (l-a)(\\e s oM\\ F \\Q 3 o(S-S)\\ F 

+ ||e^oM|| 00 ||e^o5|| 



2(1 - a) 



\\M\\oo + J )\\S-8\\ 



+ ^(~\\Z\\ 00 + K \\W-W\\ F (9) 



the last inequality being due to the assumptions 
r > f HMHop, 7 > ^l|M|U and 7 > 



by rank and support inequalities obtained again by So finally, and by using again these assumptions, 



\<f>(S-A T+1 ,W-Wo)\\l+\\<f>(S-S, W-W)\\ 



l\\$(S-A T+1 )\\l+ (v^r(V2+l) + 2Vkj) \\S-S\\ F 



+ 2^k||VF- W\\ f 
< \\$(S-A T+1 ,W-W )\\l 

+ (yFr(V2+ 1) + 2Vk^ \\§(S- S, W - W)\\ 



\\<f>(S - S,W -W)\\ F (10) 



and bx — x 2 < (|) gives 



-\\$(S-A T+1 ,W-W )\\l 



< -\\$(S-At+i,W-W )\\1 
a 



Liben-Nowell, D. and Kleinberg, J. The link- 
prediction problem for social networks. Journal 
< of the American society for information science 
and technology, 58(7):1019-1031, 2007. 

Raguet, H., Fadili, J., and Peyre, G. General- 
ized forward-backward splitting. Arxiv preprint 
arXiv.HO8.4404, 2011. 

Richard, E., Baskiotis, N.. Evgeniou, Th., and Vay- 
atis, N. Link discovery using graph feature track- 
ing. Proceedings of Neural Information Processing 
Systems (NIPS), 2010. 

Richard, E., Savalle, P.- A., and Vayatis, N. Estima- 
tion of simultaneously sparse and low-rank matri- 
ces. In Proceeding of 29th Annual International 
Conference on Machine Learning, 2012. 

Sarkar, P., Chakrabarti, D., and Moore, A.W. The- 
oretical justification of popular link prediction 
heuristics. In International Conference on Learn- 
ing Theory (COLT), pp. 295-307, 2010. 

Taskar, B., Wong, M.F., Abbeel, P., and Roller, D. 
Link prediction in relational data. In Neural In- 
formation Processing Systems, volume 15, 2003. 



and the result follows by using 



-Ms 1 -s 2 ,w 1 -w 2 )\\ 2 2 

> /i-lSi -S 2 \\% + iT 2 T\\Wx - W 2 f F 
and setting (S, W) = (A T+1 ,W ): 



\\S-A T+1 \\ 2 F + T\\W-W \\ 2 F 

<^(Vr(V2 + l) + 2Vfc 7 ) 2 + ^ □ 
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